Data Analysis

Week 7: Generalised Linear Models part 2

Jafet Belmont

School of Mathematics and Statistics

ILOs

By the end of this session you will be able to:

  • Fit and interpret the output of a logistic regression with grouped binary data.

  • Fit and interpret the coefficients of a nominal logistic model.

Overview of today’s session

Required R packages

Before we proceed, load all the packages needed for this week:

library(tidyverse)
library(ggplot2)
library(sjPlot)
library(broom)
library(performance)
library(yardstick)

First example - Turtles sex determination

This week we will continue reviewing logistic regression to model grouped binary outcomes (e.g. number of successes out of a fixed number of trials)

The turtle data set contains the number of hatched male and female turtles across different temperatures, with 3 independent replicates for each temperature.

turtles = turtles %>% 
  mutate(totals = male+female,
         male_props = male/totals)
temp male female totals male_props
27.2 1 9 10 0.10
27.2 0 8 8 0.00
27.2 1 8 9 0.11
27.7 7 3 10 0.70
27.7 4 2 6 0.67
27.7 6 2 8 0.75

Logistic regression for grouped data in R

\[ \begin{align}y_i &\sim \mathrm{Binomial}(n_i,p_i)\\\mathrm{logit}(p_i) &= \beta_0 +\beta_1 \times \mathrm{temperature}.\end{align} \]

model_turtles <- glm(male_props ~ temp,
                     data = turtles,
                     weights =  totals,
                     family = binomial)

Interpreting the log-odds

  male props
Predictors Log-Odds CI p
(Intercept) -61.32 -86.83 – -39.73 <0.001
temp 2.21 1.44 – 3.13 <0.001
Observations 15
  • For every unit increase in Temperature, the log-odds of a male being hatched increase by 2.21.

  • For every unit increase in Temperature, the odds of hatching a male are \(\mathrm{exp}(\beta_1) = 9.13\) times the odds of those with one temperature unit less.

  • If an egg is incubated at a temperature of 27.5 degrees, what at the chances (odds) of a female being hatched, i.e. \(\mathrm{Odds}(male=0|temp = 27.5)\)?

  • What is the probability of a turtle egg that is incubated in a temperature of 28.5 degrees to become a male?

Modelling grouped binary data with a categorical covariate

To illustrate how the previous model works with categorical predictors we can discretised the temperature values into arbitrary categories as follows:

\[ \mathrm{temperature~ category} =\begin{cases} \mathrm{temperature} > 29^\circ C & \text{high} \\ \mathrm{temperature} > 28^\circ C& \text{medium} \\ \text{else} & \text{low}\end{cases} \]

Code
turtles = turtles  %>% mutate(
  temp_fct = case_when(
    temp > 29 ~ "high",
    temp > 28 ~ "medium",
    .default = "low"
  ) %>% as.factor()
) 

Model fitting

  • Changing the baseline (reference) category to low.
turtles = turtles %>% mutate(temp_fct = relevel(temp_fct,ref = "low")) 
  • The Model

\[ \begin{align} y_i &\sim \mathrm{Binomial}(n_ip_i)\\ \mathrm{logit}(p_i) &= \alpha + \beta_{1} \times \mathbb{I}_{\mathrm{temperature}}(\mathrm{high}) + \beta_{2} \times \mathbb{I}_{\mathrm{temperature}}(\mathrm{medium}). \end{align} \]

  • \(\alpha\) represent the log-odds of a male turtle being hatched in low incubation temperature.

  • \(\mathbb{I}_{\mathrm{temperature}}(\mathrm{high})\) is an indicator variable that takes the value of 1 if the \(i\)th experiment replicate was conducted on a high temperature.

  • \(\mathbb{I}_{\mathrm{temperature}}(\mathrm{medium})\) is an indicator variable that takes the value of 1 if the \(i\)th experiment replicate was conducted on a medium temperature.

  • \(\beta_1\) are the change int the log-odds of a male turtle being hatched given it was incubated in a high temperature condition compared to a low one.

  • \(\beta_2\) are the change int the log-odds of a male turtle being hatched given it was incubated in a medium condition compared to a low one.

Model fitting

The model estimates on the odds scale with 95% confidence intervals

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.59 0.29 -1.80 0.07 0.33 1.04
temp_fcthigh 45.47 1.06 3.61 0.00 8.59 843.97
temp_fctmedium 6.32 0.44 4.23 0.00 2.76 15.31
  • The odds of a male being hatched if it was incubated on a low temperature condition are 0.59 the odds of a female being hatched if it was incubated on the same condition.
    • In other words, the odds of female being hatched in a low temperature are \(\mathrm{exp}(\alpha)^{-1} =\) 1.68 higher than the odds of a male being hatched under low temperature.
  • However, there is not enough evidence to support that this change in the odds is statistically significant since the confidence interval ( 0.33 , 1.04) contains 1 (remember we are in the odds scale).

Model fitting

The model estimates on the odds scale with 95% confidence intervals

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.59 0.29 -1.80 0.07 0.33 1.04
temp_fcthigh 45.47 1.06 3.61 0.00 8.59 843.97
temp_fctmedium 6.32 0.44 4.23 0.00 2.76 15.31
  • The odds of a male being hatched are 45.47! significantly higher in a high temperature setting compared to a low temperature:

Model fitting

The model estimates on the odds scale with 95% confidence intervals

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.59 0.29 -1.80 0.07 0.33 1.04
temp_fcthigh 45.47 1.06 3.61 0.00 8.59 843.97
temp_fctmedium 6.32 0.44 4.23 0.00 2.76 15.31
  • The odds of a male being hatched are 45.47! significantly higher in a high temperature setting compared to a low temperature:

\[ \dfrac{\text{Odds(male=1|temp=high)}}{\text{Odds(male=1|temp=low)}}= \dfrac{\exp(0.59+45.47)}{\exp(0.59)}=45.75 \]

Model fitting

The model estimates on the odds scale with 95% confidence intervals

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.59 0.29 -1.80 0.07 0.33 1.04
temp_fcthigh 45.47 1.06 3.61 0.00 8.59 843.97
temp_fctmedium 6.32 0.44 4.23 0.00 2.76 15.31
  • The odds of a male being hatched are 45.47! significantly higher in a high temperature setting compared to a low temperature:

  • The odds of a male being hatched are 6.32 higher in a medium temperature condition compared to a low one.

\[ \dfrac{\text{Odds(male=1|temp=medium)}}{\text{Odds(male=1|temp=low)}}= \dfrac{\exp(0.59+6.32)}{\exp(0.59)}=6.32 \]

Model fitting

The model estimates on the odds scale with 95% confidence intervals

  cbind(male,female)
Predictors Odds Ratios CI p
(Intercept) 0.59 0.33 – 1.04 0.072
temp fct [high] 45.47 8.59 – 843.97 <0.001
temp fct [medium] 6.32 2.76 – 15.31 <0.001
  • The odds of a male being hatched are 45.47! significantly higher in a high temperature setting compared to a low temperature:

  • The odds of a male being hatched are 6.32 higher in a medium temperature condition compared to a low one.

  • What if we want to compare the odds of a male being hatched if the egg was incubate on a high temperature condition against a medium one?

\[ \dfrac{\text{Odds(male=1|temp=high)}}{\text{Odds(male=1|temp=medium)}} \]

Models for multiple categorical responses

We will look at logistic regression models applied to nominal (unordered) responses with more than two categories.

  • The basis for modelling categorical data with more than two categories is the multinomial distribution for a given a random variable \(Y\) with \(J\) categories:

\[ f(\mathbf{y} | n)=\frac{n!}{y_1! y_2! \dots y_J!}p_1^{y_1}p_2^{y_2}\dots p_J^{y_J}. \qquad(1)\]

  • \(p_1,p_2,\dots, p_J\) are the probabilities associated with each of the \(J\) categories, with \(\sum_j^J p_j = 1\).

  • Suppose \(n\) independent observations from which

    \[ \begin{aligned} y_1 &\text{ outcomes in category 1}\\ y_2 &\text{ outcomes in category 2}\\ &\vdots\\ y_J &\text{ outcomes in category }J \end{aligned} \]

    \(\Rightarrow \sum_{j=1}^J y_j = n\) and \(\mathbf{y}=(y_1,y_2,\dots,y_J)^\intercal \sim \mathrm{Multinom}(n;p_j,\ldots,p_J)\)

Nominal logistic regression

  • The goal is to estimate the probabilities for each category based on some explanatory variables \(\mathbf{x}\).

  • The probability of the \(j\)th category is then given by:

\[ P(Y=j|\mathbf{x})= \dfrac{\mathrm{exp}(\mathbf{x}^\intercal \boldsymbol{\beta}_j)}{\sum_{k=1}^J\mathrm{exp}(\mathbf{x}^\intercal \boldsymbol{\beta}_k)} \]

  • As with logistic regression, one category is arbitrarily chosen as the reference category, and all other categories are compared with it.

  • If category \(J\) is chosen as the reference category. The log-odds for the other categories relative to the reference are:

\[ \mathrm{log}\left(\dfrac{P(Y=j|\mathbf{x})}{P(Y=J|\mathbf{x})}\right) = \mathrm{log}\left(\dfrac{p_j}{p_J}\right) =\mathbf{x}^\intercal\boldsymbol{\beta}_j, ~~\text{for } j=1,\dots,J-1. \]

Nominal logistic regression

We can easily derive the estimated probabilities for each class:

  • For the reference class \(J:\)\[ \hat{p}_J = \dfrac{1}{1+\sum_{j=1}^{J-1} \mathrm{exp}(\mathbf{x}^\intercal\hat{\boldsymbol{\beta}}_j)} \]

  • For class \(j = 1,2,\ldots,J-1:\) \[\hat{p}_j=\dfrac{\exp(\mathbf{x}^\intercal\hat{\boldsymbol{\beta}}_j)}{1+\sum_{j=1}^{J-1}\exp(\mathbf{x}^\intercal\hat{\boldsymbol{\beta}}_j)}.\]

Fitting a nominal logistic regression in R

In this example we look at data on subjects that were interviewed about the importance of various features when buying a car.

  • sex: woman/man
  • age: 18-23, 24-40, >40
  • response: no/little, important, very important
  • frequency: number of interviewed people on each group

Fitting a nominal logistic regression in R

We can fit the following nominal logistic regression model using the multinom() function from library(nnet):

library(nnet)
model_cars <- multinom(response ~ age + sex, weight = frequency, data = dcars)

\[ \mathrm{log}\left(\frac{p_j}{p_1} \right) = \beta_{0j}+\beta_{1j}x_1+\beta_{2j}x_2+\beta_{3j}x_3, ~~ j=2,3 \]

where

  • \(j=1\) for “no/little importance” (the reference category)
  • \(j=2\) for “important”
  • \(j=3\) for “very important”
  • \(x_1=1\) for women and 0 for men,
  • \(x_2=1\) for age 24-40 years and 0 otherwise
  • \(x_3=1\) for age \(>\) 40 years and 0 otherwise.

Interpreting the output

Lets look at model summaries and interpret the coefficients.

  response
Predictors Log-Odds CI p Response
(Intercept) -0.98 -1.55 – -0.41 0.003 important
age24-40 1.13 0.37 – 1.89 0.008 important
age [> 40] 1.59 0.69 – 2.49 0.003 important
sex [women] 0.39 -0.28 – 1.06 0.226 important
(Intercept) -1.85 -2.59 – -1.12 <0.001 very important
age24-40 1.48 0.58 – 2.37 0.004 very important
age [> 40] 2.92 1.97 – 3.86 <0.001 very important
sex [women] 0.81 0.10 – 1.53 0.030 very important

Examples: Interpreting the output

Lets focus on the comparison important vs little important

y.level term estimate std.error statistic p.value conf.low conf.high
important (Intercept) -0.98 0.26 -3.82 0.0 -1.48 -0.48
important age24-40 1.13 0.34 3.30 0.0 0.46 1.80
important age> 40 1.59 0.40 3.94 0.0 0.80 2.38
important sexwomen 0.39 0.30 1.29 0.2 -0.20 0.98

\[\begin{aligned} \mathrm{log}\left(\frac{\hat{p}_{\text{important}}}{\hat{p}_{\text{no/little}}} \right) &= \hat{\beta}_{0,\text{important}} + \hat{\beta}_{1,\text{important}} x_1 + \hat{\beta}_{2,\text{important}} x_2 + \hat{\beta}_{3,\text{important}} x_3 \\ =&-0.98 +0.39 x_1+ 1.13 x_2+1.59x_3, \end{aligned}\]

  • \(x_1=1\) for women and 0 for men

  • \(x_2=1\) for age 24-40 years and 0 otherwise

  • \(x_3=1\) for age \(>\) 40 years and 0 otherwise.

Examples: Interpreting the output

Lets focus on the comparison important vs little important

\[\begin{aligned} \mathrm{log}\left(\frac{\hat{p}_{\text{important}}}{\hat{p}_{\text{no/little}}} \right) &= \color{#27B0F5}{\hat{\beta}_{0,\text{important}}} + \hat{\beta}_{1,\text{important}} x_1 + \hat{\beta}_{2,\text{important}} x_2 + \hat{\beta}_{3,\text{important}} x_3 \\ =& \color{#27B0F5}{-0.98} +0.39 x_1+ 1.13 x_2+1.59x_3, \end{aligned}\]

  • \(x_1=1\) for women and 0 for men

  • \(x_2=1\) for age 24-40 years and 0 otherwise

  • \(x_3=1\) for age \(>\) 40 years and 0 otherwise.

Lets break this down:

  • What is \(\hat{\beta}_{0,\text{important}}\)?

Examples: Interpreting the output

Lets focus on the comparison important vs little important

\[ \require{cancel} \begin{aligned} \mathrm{log}\left(\frac{\hat{p}_{\text{important}}}{\hat{p}_{\text{no/little}}} \right) &= \color{#27B0F5}{\hat{\beta}_{0,\text{important}}} + \hat{\beta}_{1,\text{important}} \cancelto{0}{x_1} + \hat{\beta}_{2,\text{important}} \cancelto{0}{x_2} + \hat{\beta}_{3,\text{important}} \cancelto{0}{x_3} \\ =&\color{#27B0F5}{-0.98} + \cancelto{0}{0.39 x_1}+ \cancelto{0}{1.13 x_2}+\cancelto{0}{1.59x_3} = -0.98 \end{aligned} \]

  • \(x_1=1\) for women and 0 for men

  • \(x_2=1\) for age 24-40 years and 0 otherwise

  • \(x_3=1\) for age \(>\) 40 years and 0 otherwise.

Lets break this down:

  • What is \(\hat{\beta}_{0,\text{important}}\)?

\(\hat{\beta}_{0,\text{important}} = -0.98\) represent the log-odds of considering the features important vs no/little important for men between 18-23 since these are our two reference categories (this is the case when \(x_{1} = x_2 = x_3 = 0\))

Examples: Interpreting the output

Lets focus on the comparison important vs little important

\[\begin{aligned} \mathrm{log}\left(\frac{\hat{p}_{\text{important}}}{\hat{p}_{\text{no/little}}} \right) &= \hat{\beta}_{0,\text{important}} + \hat{\beta}_{1,\text{important}} x_1 + \hat{\beta}_{2,\text{important}} x_2 + \hat{\beta}_{3,\text{important}} x_3 \\ =&-0.98 +0.39 x_1+ 1.13 x_2+1.59x_3, \end{aligned}\]

  • \(x_1=1\) for women and 0 for men

  • \(x_2=1\) for age 24-40 years and 0 otherwise

  • \(x_3=1\) for age \(>\) 40 years and 0 otherwise.

Lets break this down:

  • Then, each \(\hat{\beta}_{i,j}\) shows how the log-odds change when moving from the reference categories to the other levels

Examples: Interpreting the output

Lets focus on the comparison important vs little important

\[\begin{aligned} \mathrm{log}\left(\frac{\hat{p}_{\text{important}}}{\hat{p}_{\text{no/little}}} \right) &= \hat{\beta}_{0,\text{important}} + \hat{\beta}_{1,\text{important}} x_1 + \hat{\beta}_{2,\text{important}} x_2 + \hat{\beta}_{3,\text{important}} x_3 \\ =&-0.98 +0.39 x_1+ 1.13 x_2+1.59x_3, \end{aligned}\]

  • \(x_1=1\) for women and 0 for men

  • \(x_2=1\) for age 24-40 years and 0 otherwise

  • \(x_3=1\) for age \(>\) 40 years and 0 otherwise.

Lets break this down:

  • Then, each \(\hat{\beta}_{i,j}\) shows how the log-odds change when moving from the reference categories to the other levels
  • For example, what are log-odds of considering the feature important vs no/little important for women between 24-40 ?

Examples: Interpreting the output

Lets focus on the comparison important vs little important

\[\begin{aligned} \mathrm{log}\left(\frac{\hat{p}_{\text{important}}}{\hat{p}_{\text{no/little}}} \right) &= \hat{\beta}_{0,\text{important}} + \hat{\beta}_{1,\text{important}} \color{purple}{x_1} + \hat{\beta}_{2,\text{important}} \color{green}{x_2} + \color{red}{\hat{\beta}_{3,\text{important}} \cancelto{0}{x_3}} \\ &= -0.98 + \color{purple}{0.39} + \color{green}{1.13} + \color{red}{\cancelto{0}{1.59~x_3}} \approx ~0.54 \end{aligned}\]

  • \(x_1=1\) for women and 0 for men

  • \(x_2=1\) for age 24-40 years and 0 otherwise

  • \(x_3=1\) for age \(>\) 40 years and 0 otherwise.

Lets break this down:

  • Then, each \(\hat{\beta}_{i,j}\) shows how the log-odds change when moving from the reference categories to the other levels

  • For example, what are log-odds of considering the feature important vs no/little important for women between 24-40 ?

Examples: Interpreting the output

Lets focus on the comparison important vs little important

\[\begin{aligned} \mathrm{log}\left(\frac{\hat{p}_{\text{important}}}{\hat{p}_{\text{no/little}}} \right) &= \hat{\beta}_{0,\text{important}} + \hat{\beta}_{1,\text{important}} x_1 + \hat{\beta}_{2,\text{important}} x_2 + \hat{\beta}_{3,\text{important}} x_3 \\ &= -0.98 +0.39 x_1+ 1.13 x_2+1.59x_3, \end{aligned}\]

  1. What are the odds of a male over 40 years old to consider power steering and air conditioning important vs little important?

\[\text{Odds}(\text{Feauture=important|Age > 40, Gender = male})\]

Examples: Interpreting the output

Lets focus on the comparison important vs little important

\[\begin{aligned} \exp\left\{\mathrm{log}\left(\frac{\hat{p}_{\text{important}}}{\hat{p}_{\text{no/little}}} \right)\right\} &= \exp\{\hat{\beta}_{0,\text{important}} + \hat{\beta}_{1,\text{important}}\cancelto{0}{x_1} + \hat{\beta}_{2,\text{important}} \cancelto{0}{x_2} + \hat{\beta}_{3,\text{important}} x_3\} \\ &= \exp\{-0.98 + \color{gray}{0.39 (0)}+ \color{gray}{1.13 (0)}+ 1.59(1)\} \approx ~ 1.82 \end{aligned}\]

  1. What are the odds of a male over 40 years old to consider power steering and air conditioning important vs little important? i.e.,

\[\text{Odds}(\text{Feauture=important|Age > 40, Gender = male})\]

Examples: Interpreting the output

Lets focus on the comparison important vs little important

\[\begin{aligned} \exp\left\{\mathrm{log}\left(\frac{\hat{p}_{\text{important}}}{\hat{p}_{\text{no/little}}} \right)\right\} &= \exp\{\hat{\beta}_{0,\text{important}} + \hat{\beta}_{1,\text{important}} x_1 + \hat{\beta}_{2,\text{important}} x_2 + \hat{\beta}_{3,\text{important}} x_3\} \\ &= \exp\{-0.98 + 0.39 x_1+ 1.13 x_2+ 1.59x_3\} \end{aligned}\]

  1. What are the odds of a person over 40 years old to consider power steering and air conditioning important vs little important compared to a younger person (of the same gender) between 18-23 years old? i.e.,

\[\dfrac{\text{Odds}(\text{Feature = important|Age > 40})}{\text{Odds}(\text{Feature = important|Age = 18-23})}\]

Examples: Interpreting the output

Lets focus on the comparison important vs little important

\[\begin{aligned} \exp\left\{\mathrm{log}\left(\frac{\hat{p}_{\text{important}}}{\hat{p}_{\text{no/little}}} \right)\right\} &= \exp\{\hat{\beta}_{0,\text{important}} + \hat{\beta}_{1,\text{important}} x_1 + \hat{\beta}_{2,\text{important}} x_2 + \hat{\beta}_{3,\text{important}} x_3\} \\ &= \exp\{-0.98 + 0.39 x_1+ 1.13 x_2+ 1.59x_3\} \end{aligned}\]

  1. What are the odds of a person over 40 years old to consider power steering and air conditioning important vs little important compared to a younger person (of the same gender) between 18-23 years old? i.e.,

\[\begin{aligned} \dfrac{\text{Odds}(\text{Feature = important|Age > 40})}{\text{Odds}(\text{Feature = important|Age = 18-23})} &= \dfrac{\exp\{\hat{\beta}_{0,\text{important}} + \hat{\beta}_{1,\text{important}} x_1 + \hat{\beta}_{2,\text{important}} \cancelto{0}{x_2} + \hat{\beta}_{3,\text{important}} x_3\}}{\exp\{\hat{\beta}_{0,\text{important}} + \hat{\beta}_{1,\text{important}} x_1 + \hat{\beta}_{2,\text{important}} \cancelto{0}{x_2} + \hat{\beta}_{3,\text{important}} \cancelto{0}{x_3}\}} \\ &= \dfrac{\exp\{\cancel{\color{red}{\hat{\beta}_{0,\text{important}}}} + \cancel{\color{red}{\hat{\beta}_{1,\text{important}} x_1}} + \hat{\beta}_{3,\text{important}} x_3\}}{\exp\{\cancel{\color{red}{\hat{\beta}_{0,\text{important}}}} + \cancel{\color{red}{\hat{\beta}_{1,\text{important}} x_1}} \}} \\ & = \exp\{ \hat{\beta}_{3,\text{important}} \} = \exp\{1.59\} \end{aligned} \]

The odds of a 40 years old to consider this feature important are \(\exp(1.59) \approx 4.89\) times higher compared to a person who is in the 18-23 age category (which is the base line for age)

Examples: Interpreting the output

Lets focus on the comparison important vs little important

\[\begin{aligned} \exp\left\{\mathrm{log}\left(\frac{\hat{p}_{\text{important}}}{\hat{p}_{\text{no/little}}} \right)\right\} &= \exp\{\hat{\beta}_{0,\text{important}} + \hat{\beta}_{1,\text{important}} x_1 + \hat{\beta}_{2,\text{important}} x_2 + \hat{\beta}_{3,\text{important}} x_3\} \\ &= \exp\{-0.98 + 0.39 x_1+ 1.13 x_2+ 1.59x_3\} \end{aligned}\]

  1. What are the odds of a woman between 24-40 to consider power steering and air conditioning important vs little important compared to a male of the same age?

\[\begin{aligned} \dfrac{\text{Odds}(\text{Feauture=important| Gender = female})}{\text{Odds}(\text{Feauture= important| Gender = male})} &= \dfrac{\exp\{-0.98 + \color{purple}{0.39 x_1}+ \color{green}{1.13 x_2}+ 1.59 \color{red}{\cancelto{0}{x_3}}\}}{\exp\{-0.98 + 0.39 \color{red}{\cancelto{0}{x_1}}+ \color{green}{1.13 x_2}+ 1.59\color{red}{\cancelto{0}{x_3}}\}}\\ &= \dfrac{\exp(\color{grey}{-0.98}+ 0.39 + \color{grey}{1.13} )}{\exp(\color{grey}{-0.98} + \color{grey}{1.13})}=\exp(0.39) \approx 1.47 \end{aligned} \]